-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use URI scheme instead of FileSystem type to write the correct value in the DB #726
Conversation
loadSpec = ImmutableMap.<String, Object>of( | ||
"type", "hdfs", | ||
"path", indexOutURI.getPath() | ||
); | ||
} else { | ||
throw new ISE("Unknown file system[%s]", outputFS.getClass()); | ||
throw new ISE("Unknown file system[%s] for URI[%]", outputFS.getClass(), indexOutURI.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we now check URI scheme, it might be more useful to indicate the unknown scheme as opposed to the filesystem in the error message, i.e "Unknown scheme[%s] for output URI[%s]"
Hi @davideanastasia, this is great. We have your personal CLA, do you know if you need to have a corporate one signed? If not, we can merge this I think. |
Hi, my employer is happy for my patches to go under my personal CLA. Hence, I will keep using my personal email address to do the commits/pull requests. Thanks :) |
if (outputFS instanceof NativeS3FileSystem) { | ||
String indexOutScheme = indexOutURI.getScheme(); | ||
|
||
if ((indexOutScheme == null) || (indexOutScheme.compareToIgnoreCase("file") == 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it always true that a null scheme always maps to local filesystem? Is it possible that this may depend on what the default hadoop filesystem is set to? It would be good to double-check the behavior there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is null when the string is of the type /home/davide/hadoop. To maintain back-compatibility, I have to assume that no scheme means local filesystem.
Further, it shouldn't depend on Hadoop, as the URI class is not using the Configuration class in any way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hadoop interprets null schemes using fs.defaultFS, which might not be local, so I think this patch will break things for users that expect to provide "/foo/bar" and have the segments actually go to HDFS.
It should work to do FileSystem.get(new Path(schema.getIOConfig().getSegmentOutputPath()), context.getConfiguration()
to get the right fileSystem for the base segment output path, then call config.makeSegmentOutputPath on that, and keep the instanceof checks that were originally in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outputFS instanceof NativeS3FileSystem
is false in AWS EMR even though you want to write on S3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with gian, for us "null" uri scheme means putting segments on the hdfs and not local. we never explicitly set "hdfs://.../" in the segmentOutputPath but just "/foo/bar".
8b0ec82
to
d05032b
Compare
@davideanastasia, would it work to add some or'ed checks to the s3_zip conditional saying that we should do that if it's a S3NativeFileSystem or if finalIndexZipFilePath's scheme is "s3" or "s3n"? I think that should fix your problem but also not break the behavior for null schemes. |
@gianm what you are proposing should work. Would you like me to do the change? |
@davideanastasia, if you could, that would be awesome. |
I have also been pushing for more and better URI support. This patch went in recently which started adding more URI standard usage in the code: |
@gianm Done! Sorry for the long wait, I have been quite busy in the last week or so at work :( |
can you also fix the merge conflicts ? |
@nishantmonu51 wow, didn't realized this piece of code had change in this way. I have changed my patch to accommodate the new code structure with the fully qualified class name (which I have take straight from a valid EMR cluster I currently have live). |
I actually started looking into this code path for other tasks, and I would like to propose using outputFs.getScheme() to determine the type of FS |
Ideally then omitting any sort of scheme in the URI will simply use whatever method Hadoop uses to resolve URIs, and we can simply use the appropriate known schemes until Druid has more URI support. |
@davideanastasia, sorry for letting this fall by the wayside. The patch looks good and I'd like to merge it. Can you please fill out the Druid CLA at http://druid.io/community/cla.html? Don't worry about the merge conflicts- I can fix those when merging the patch. |
@gianm you already have a CLA that I have signed when I made a small pull request into the PyDruid. Is that enough? |
I believe this is solved by #1428 which uses scheme instead of class name. |
Ah, okay, I see: druid-io/pydruid#12 (comment). That's enough. If you don't mind filling out the form, that'd still be nice, but it's not required. @drcrallen actually just pointed out to me that #1428 was recently merged and should solve this same problem a different way. So, I'm going to close this PR, but please let us know if you still have problems after running a version of Druid that includes #1428. |
…e#726) * make clarity-emitter http client worker pool size configurable * add default worker pool size of min of 10 or 2 * number cores
On EMR, the default FileSystem for S3 is com.amazon.ws.emr.hadoop.fs.EmrFileSystem, which is automatically configured with Key and SecretKey during the cluster startup. However, Druid is expecting to use NativeS3FileSystem to write into S3, which fails on EMR with the following stack trace:
Error: com.metamx.common.ISE: Unknown file system[class com.amazon.ws.emr.hadoop.fs.EmrFileSystem] at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.serializeOutIndex(IndexGeneratorJob.java:453) at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:387) at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:252) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:635) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
While this can be corrected using
-Dhadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem
and many other option to setup the access key and secret key, this patch will remove the problem